{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Overview - Airbnb Pricing Predictions\n",
    "<hr/>\n",
    "\n",
    "### Introduction\n",
    "\n",
    "\n",
    "Since its founding, Airbnb has hosted over 60 million people in 34,000 cities across the world and is continuing to grow quickly. Airbnb provides a convenient source of income for people who have otherwise vacant space and for guests looking for affordable and convenient housing options. With any service, trying to monitor and understand the underlying pricing dynamics of the Airbnb market is very important both for hosts and guests. As users continue to grow on both the supply and demand side, homeowners may find it hard to properly price their property. Airbnb has recognized this and conducted considerable research into suggesting pricing from a [supply side standpoint](http://spectrum.ieee.org/computing/software/the-secret-of-airbnbs-pricing-algorithm). \n",
    "\n",
    "\n",
    "We seek to analyze over 27,000 listings in the NYC area in order to better understand how the use of listing attributes such as bedrooms, location, ratings, and more can be used to accurately predict the optimal listing price both for the host and guest. Holiday and seasonality is another useful component that can attract more customers and drive higher prices, but it is unclear how much of a premium one should pay per holiday. With better price suggestion estimates, Airbnb home providers can reach an equilibrium price that optimizes profit and affordability. The objective of this project is to build a model that predicts the optimal price of a property taking into account listing features and seasonality. The end goal is so users can understand what features of an AirBnB listing are most important as well as how prices should be fluctuating based on seasonality.\n",
    "\n",
    "\n",
    "### Data Cleaning\n",
    "\n",
    "\n",
    "The main files for data cleaning: ```data-cleaning-listings.ipynb, seasonality-cleaning-WJ.ipynb, zipcode-clustering.ipynb```. From the available 52 available predictors, 14 intuitive predictors were selected. After cleaning the data (e.g. accounting for missing values, invalid entry formatting, inconsistencies in the data) we retained 24,050 records out of the original 27,392.\n",
    "\n",
    "\n",
    "### Data Exploration\n",
    "\n",
    "\n",
    "The main files for data exploration: ```data-exploration-listings.ipynb, seasonality-exploration-WJ.ipynb```. Our analysis was focused on Airbnb listings in NYC. Based on a scatter_matrix plot containing all predictors, there was no strong evidence for collinearity amongst the predictors. However, many of the predictors as well as price itself was strongly skewed right. To account for this, price was log transformed; however, this is certainly one inherent difficulty having to account for a skewed predictor in fitting a linear regression model on prices. From the location exploration graphs, we see that our data was fairly concentrated in the Manhattan / Brooklyn area, with various size clusterings across the city itself (Figure 1).\n",
    "\n",
    "\n",
    "![Zipcode Heat Map](../img/zipcode-heat-map.jpg)\n",
    "\n",
    "\n",
    "### Data Modeling\n",
    "\n",
    "\n",
    "The main files for data modeling: ```data-modeling-baseline.ipynb, data-modeling-baseline-cluster.ipynb, seasonality-exploration-WJ.ipynb```. Various regression models were fit such as Linear Regression, Ridge Regression, and Lasso Regression. While these models seemed to indicate a poor fit based on their $R^2$ score (~0.3 for a test dataset score), analyzing by Median Absolute Error proved to be significantly more insightful. Evaluating by Median Absolute Error is a more robust method than RSS in this setting, as it is less affected by outliers (an issue that was raised in predicting pricing especially for Airbnb listings) and yields a more intuitive score that is easily translatable to pricing. A Ridge Regression obtained an optimal Median Absolute Error of \\$21.43. However, when analyzing by single house listing data, a Bayes Ridge Regression obtained an optimal Median Absolute Error of \\$19.43 (Figure 2). \n",
    "\n",
    "\n",
    "![Median Absolute Error](../img/model-median-absolute-error.png)\n",
    "\n",
    "\n",
    "Models such as Polynomial RidgeCV and ensemble methods such as Random Forest Regressor proved to be extremely computationally expensive and with some preliminary tuning did not perform as well as models such as RidgeCV. Further exploration can be done in regards to more comprehensively tuning these models, but will require additional methods to speed up CPU and time costs.\n",
    "\n",
    "\n",
    "#### Incorporating Seasonality\n",
    "After cleaning the seasonality data, the main source of analysis on the seasonality data is in the ```seasonality-exploration-WJ.ipynb```. Overall, looking at seasonality was a worthwhile endeavor. There were noticeable changes in the prices depending on the day of the week. On average, Friday and Saturday listings proved to have the highest price on Airbnb. Both are priced on average 103% the Monday price. Tuesday and Wednesday are on average the least expensive, both around 99% monday prices. Overall, the results suffered from greater error than our original RidgeCV regression, as we experienced a median absolute error of $72.10. This is for three reasons -- many of the listings still do not incorporate day-variations in their prices, inaccuracies within the original RidgeCV predictions, and large rather than small price difference between days of the week. However, this area perhaps is the most promising because it shows that many Airbnb listers are not taking advantage of dynamic pricing by the day of the week, something that is important to establish optimal pricing. Already, there are some promising results -- there is evidence to suggest that hosts should price Friday and Saturday the highest and Tuesday and Wednesday the lowest. With further refinements to our model, such as looking at various seasonal time series models, people may be able to look at the trends and price their Airbnb listings more appropriately and with more concrete percentage change suggestions.\n",
    "\n",
    "\n",
    "### Conclusion\n",
    "\n",
    "Using a Ridge Regression we were able to fit a model that obtained a Median Absolute Error of \\$21.43 for all listing data and \\$19.43 on single bedroom listings, beating that of a previous study that assessed single bedroom listing data. For a practical predictor to be used in practice, future work will need to be done to further explore and build more suitable models. Predicting a significantly skewed right response variable, price, yields a set of challenges that need to be addressed. Rather than predicting a specific price, future work could consider predicting a bucket price range that a listing would fall into. By turning this into a classification problem, accuracy may very well increase and this model would be more useful from a practical standpoint. Additionally, by predicting prices in bucket range, a host could have a greater degree of autonomy and perhaps confidence in selecting the price that they deemed fit. Finally, further analysis on the trends for seasonality could yield an improved model. There are certainly promising patterns in seasons and day of the week that play a role in determining the optimal Airbnb listing price."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
